Profile callprofiler with different testcases#829
Conversation
22b55de to
65972c4
Compare
ThreeMonth03
left a comment
There was a problem hiding this comment.
@yungyuc Please take a look. This pull request is quite long, so I'm wondering whether there are better ways to generate code and profile the benchmark on different platform.
| - name: make cprof | ||
| if: runner.os == 'Linux' | ||
| run: | | ||
| make cprof | ||
|
|
There was a problem hiding this comment.
Profile profiler only on linux.
| return nullptr; | ||
| } | ||
|
|
||
| bool run_named_case(std::string_view label, std::size_t size, std::size_t repeat_count) |
There was a problem hiding this comment.
Run different types of functions with different hyperparameter.
| std::cout << "RESULT workload=" << label | ||
| << " operations=" << operation_count | ||
| << " repeats=" << repeat_count | ||
| << " workload_seconds=" << elapsed.count() | ||
| << '\n'; |
There was a problem hiding this comment.
This file will print the wall time of benchmark, because we cannot obtain wall time from gprof
| void configure_large_stack() | ||
| { | ||
| #if defined(__linux__) | ||
| rlimit limit{}; | ||
| if (getrlimit(RLIMIT_STACK, &limit) == 0) | ||
| { | ||
| if (RLIM_INFINITY == limit.rlim_max || limit.rlim_cur < limit.rlim_max) | ||
| { | ||
| limit.rlim_cur = limit.rlim_max; | ||
| static_cast<void>(setrlimit(RLIMIT_STACK, &limit)); | ||
| } | ||
| } | ||
| #endif | ||
| } |
There was a problem hiding this comment.
Configure enough stack size at first, because the depth of callers may be 50000.
| std::array<case_definition, 4> const case_definitions{{ | ||
| {"wide_siblings", &workload::run_wide_siblings}, | ||
| {"deep_chain", &workload::run_deep_chain}, | ||
| {"balanced_tree", &workload::run_balanced_tree}, | ||
| {"hot_name_reuse", &workload::run_hot_name_reuse}, | ||
| }}; |
There was a problem hiding this comment.
4 kinds of benchmark. They are generated by python scripts.
| add_custom_command( | ||
| OUTPUT ${CPROF_GENERATED_SOURCES} | ||
| COMMAND "${PYTHON_EXECUTABLE}" "${CPROF_GENERATOR}" | ||
| --output-dir "${CPROF_GENERATED_DIR}" | ||
| --shards "${CPROF_SHARD_COUNT}" | ||
| DEPENDS "${CPROF_GENERATOR}" | ||
| VERBATIM | ||
| ) |
There was a problem hiding this comment.
Generate benchmarks.
There was a problem hiding this comment.
I've tried to generate cpp files with macro, but it is too slow to generate 50000 functions.
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() |
There was a problem hiding this comment.
Scripts to run the execute file of profiling/cprof/callprofiler_gprof.cpp.
There was a problem hiding this comment.
I'm not sure whether to put cpp files in /profiling.
65972c4 to
45dc6fb
Compare
yungyuc
left a comment
There was a problem hiding this comment.
Please describe what you wanted to see and summarize what you found in each of the 4 profiling cases in the opening comment. It is not obvious what you want to say with the plain profiling data.
To profile the CallProfiler is a good idea.
ThreeMonth03
left a comment
There was a problem hiding this comment.
@yungyuc Please take a look. I think I've pointed out why we need to optimize some hotspots now.
Introduction
Issue #831 implies that a slow callprofiler might make profiling error bigger. To monitor whether callprofiler works fast in every cases, this pull request profilers callprofiler with different testcases by gprof. Additionally, this pull request shows what and why profiler works slowly under some cases.
Analysis
There are 4 functions and 4 kinds of hyperparameters, totally 16 combinations in benchmark. As for the detail of methodology and profiling result, please go over in appendix section.
The following sheet is the consuming time of top 5 combinations:
wide_siblingswide_siblingswide_siblingsbalanced_treedeep_chainWe notice that top 3 slowest functions are
wide_siblings(), which looks like:It means that callprofiler works slow when profiling a function, which has a large number of direct callees in the same call depth. Especially the top 1 case, it takes 6 second for callprofiler to profile 50000 empty direct callees.
Profiler works slowly in these cases, since the data structure of profiler is trie, and it takes linear time to find target children for a node. The data structure of children list is
std::list.modmesh/cpp/modmesh/toggle/RadixTree.hpp
Lines 90 to 95 in 5a7beb6
modmesh/cpp/modmesh/toggle/RadixTree.hpp
Line 52 in 27a0019
Every node in trie corresponds to a registered function, and there is a pointer pointing to a node which function is executing. When a registered function calls a registered callee, the profiler would find the corresponding child nodes at first, then move the pointer to the child node. If a registered function returns, the pointer would be moved to its parent.
Therefore, if there are n registered callees, it takes quardratic times to search n children nodes. For example, there are n = 50000 registered callees in the first case. It takes some time to find target children nodes of
ncallees, since O(n^2) ~= O(50000^2).In conclusion, if callprofiler needs profiling a function with a lot of direct callees in real life, the search capability of
CallProfiler::start_caller()should be optimized furthermore.Appendix
Benchmark
To simulate the requirements in real life, I design 4 types of functions in benchmarks. In every functions, there are some operations, which are empty functions with different names.
Function types
By the way, this pull request sweeps the number of operations in every functions. The number of operations might be 100, 1000, 10000, or 50000. To obtain a precise profiling result, this pull request also repeats and resets the profiler when the number of operations is small, because gprof is sampling-based profiler.
The following data are measured on WSL2, with intel 13700 CPU.
Since gprof is integrated with g++, scripts in this pull request are only supported for linux platform now.
wide_siblings
gprof top 5: operations
100, repeats10000gprof top 5: operations
1000, repeats1000gprof top 5: operations
10000, repeats5gprof top 5: operations
50000, repeats1deep_chain
gprof top 5: operations
100, repeats10000gprof top 5: operations
1000, repeats1000gprof top 5: operations
10000, repeats5gprof top 5: operations
50000, repeats1balanced_tree
gprof top 5: operations
100, repeats10000gprof top 5: operations
1000, repeats1000gprof top 5: operations
10000, repeats5gprof top 5: operations
50000, repeats1hot_name_reuse
gprof top 5: operations
100, repeats10000gprof top 5: operations
1000, repeats1000gprof top 5: operations
10000, repeats5gprof top 5: operations
50000, repeats1